Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

)

oV-2 sequences (greater than 0.02). Figure 7.13 shows the partial

multiple sequence alignment using msa, where ten SARS-CoV-

ces were exactly clustered together.

Fig. 7.13. The msa result for 17 genome sequences.

he alignment-based approach versus the alignment-free

h for sequence comparison

gnment-free multiple sequence comparison approach can support

e genome pattern discovery, it is interesting to compare the

t-free approach with the alignment-based approach for three

.e., the speed, the accuracy and the pattern discovery power.

he speed comparison

the alignment-free sequence comparison approach is saving the

e significantly was examined based on a 3-mer word data set.

udo nucleotide sequences were generated. Their lengths were

om 1,000 to 10,000. The mutation rate on one sequence was 10%.

edleman-Wunsch algorithm was applied for the homology

t between them, i.e., the alignment-based sequence comparison.

er word frequency was calculated for each sequence. The pair-

ances between sequences were calculated based on the word

y vectors. The CPU time was recorded. Figure 7.14(a) shows the

on. It can be seen that the CPU time of the alignment-free